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On  the  Myopic  Policy  for  a  Class  of  Restless  Bandit  Problems 
with  Applications  in  Dynamic  Multichannel  Access 

Keqin  Liu  and  Qing  Zhao 


Abstract 

We  consider  a  class  of  restless  multi-armed  bandit  problems  that  arises  in  multi-channel  opportunis¬ 
tic  communications,  where  channels  are  modeled  as  independent  and  stochastically  identical  Gilbert- 
Elliot  channels  and  channel  state  observations  are  subject  to  errors.  We  show  that  the  myopic  channel 
selection  policy  has  a  semi-universal  structure  that  obviates  the  need  to  know  the  Markovian  transition 
probabilities  of  the  channel  states.  Based  on  this  semi-universal  structure,  we  establish  closed-form 
lower  and  upper  bounds  on  the  maximum  throughput  (i.e.}  average  reward)  achieved  by  the  myopic 
policy.  Furthermore,  we  characterize  the  approximation  factor  of  the  myopic  policy  by  considering  a 
genie-aided  system. 


Index  Terms 
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I.  Introduction 

A.  Dynamic  Multichannel  Access 

We  consider  the  following  stochastic  optimization  problem  that  arises  in  multichannel  op¬ 
portunistic  communications.  Assume  that  there  are  N  independent  and  stochastically  identical 
Gilbert-Elliot  channels  [1],  As  illustrated  in  Fig.  1,  the  state  of  a  channel  —  “good”  or  “bad” 
—  indicates  the  desirability  of  accessing  this  channel  and  determines  the  resulting  reward. 
The  transitions  between  these  two  states  follow  a  discrete-time  Markov  chain  with  transition 
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probabilities  {Pij}i,je{ 0,1}-  This  channel  model  has  been  commonly  used  to  abstract  physical 
channels  with  memory.  Consider,  for  example,  the  emerging  application  of  cognitive  radios  for 
opportunistic  spectrum  access  where  secondary  users  search  in  the  spectrum  for  idle  channels 
temporarily  unused  by  primary  users  [2].  For  this  application,  the  good  state  represents  an 
idle  channel  while  the  bad  state  an  occupied  channel.  When  the  primary  network  employs 
load  balancing  across  channels,  the  occupancy  processes  of  all  channels  can  be  considered 
stochastically  identical. 

In  each  time  slot,  a  user  chooses  M  out  of  the  N  channels  to  sense  and  subsequently  access 
channels  sensed  to  be  in  the  good  states.  Sensing  is  subject  to  errors:  a  good  channel  may  be 
sensed  as  bad  and  vice  versa.  Accessing  a  good  channel  results  in  a  unit  reward,  and  no  access 
or  accessing  a  bad  channel  leads  to  zero  reward.  The  design  objective  is  the  optimal  sensing 
policy  for  dynamic  channel  selection  in  order  to  maximize  the  expected  long-term  reward. 


Pm 


Fig.  1.  The  Gilber-Elliot  channel  model. 


B.  Restless  Multi-armed  Bandit  and  Myopic  Policy 

This  problem  can  be  formulated  as  a  partially  observable  Markov  decision  process  (POMDP) 
for  generally  correlated  channels  [3],  or  a  restless  multi-armed  bandit  process  (RMBP)  for  inde¬ 
pendent  channels  considered  here.  The  maximum  throughput  of  the  multi-channel  opportunistic 
system  is  essentially  the  long-term  expected  maximum  average  reward,  or  the  time-normalized 
value  function,  of  an  RMBP.  Unfortunately,  obtaining  optimal  solutions  to  a  general  restless 
bandit  process  is  PS  PACE-hard  [4],  and  analytical  characterization  of  the  performance  of  the 
optimal  policy  is  often  intractable. 

We  thus  focus  on  the  low-complexity  myopic  policy  which  has  been  shown  to  be  optimal  for 
this  class  of  restless  bandit  problems  under  certain  conditions  (see  Sec.  I-C).  Specifically,  we 
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establish  a  semi-universal  structure  of  the  myopic  policy  and  characterize  its  performance  and 
approximation  factor  as  detailed  below. 

1 )  Structure  of  the  Myopic  Policy:  We  first  show  that  the  myopic  policy  has  a  semi-universal 
structure  under  the  condition  that  the  probability  of  false  alarm  of  the  channel  state  detector 
is  below  a  certain  value.  This  structure  reveals  that  the  myopic  policy  does  not  require  the 
knowledge  of  the  transition  probabilities  of  the  Markovian  channel  model  except  the  order  of 
P\ i  and  po i . 

2)  Performance  of  the  Myopic  Policy:  Based  on  the  semi-universal  structure  of  the  myopic 
policy,  we  develop  closed-form  lower  and  upper  bounds  on  the  steady-state  throughput  under  the 
myopic  policy  that  monotonically  tighten  as  the  number  N  of  channels  increases.  When  each 
channel  is  positively  correlated  (pu  >  pm),  we  further  obtain  the  limiting  performance  of  the 
myopic  policy  as  N  approaches  to  infinity. 

3)  Approximation  Factor  of  the  Myopic  Policy:  By  considering  a  genie-aided  system,  we 
develop  an  upper  bound  on  the  optimal  performance,  which  provides  a  performance  benchmark 
for  the  myopic  policy.  This  result,  coupled  with  the  lower  bound  on  the  performance  of  the 
myopic  policy,  leads  to  an  analytical  characterization  of  the  approximation  factor  of  the  myopic 
policy.  Specifically,  we  show  that  the  myopic  policy  achieves  at  least  jt  of  the  optimal  per¬ 
formance  when  channels  are  positively  correlated,  and  max{|,  44}  of  the  optimal  performance 
when  channels  are  negatively  correlated  (pu  <  pa\). 

C.  Related  Work 

1)  Perfect  State  Observation:  Under  the  assumption  of  single-channel  perfect  sensing,  the 
semi-universal  structure  of  the  myopic  policy  has  been  established  for  all  N,  and  the  optimality 
of  the  myopic  policy  proved  for  N  =  2  and  conjectured  for  N  >2  in  [5].  Furthermore, 
closed-form  bounds  on  the  throughput  under  the  myopic  policy  have  been  established.  A  recent 
follow-up  work  [6]  has  extended  the  optimality  of  the  myopic  policy  to  all  N  under  the  condition 
of  Pu  >  pa  i  • 

For  independent  and  non-identical  channels  under  multi-channel  perfect  sensing,  Whittle’s 
index  policy  under  both  discounted  and  average  reward  criteria  has  been  established  in  [7]. 
An  efficiently  computable  upper  bound  on  the  optimal  performance  has  been  established  based 
on  Whittle’s  relaxation.  Numerical  results  have  illustrated  the  strong  performance  of  Whittle’s 
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index  policy.  For  independent  and  identical  channels,  Whittle’s  index  policy  has  been  shown  to  be 
equivalent  to  the  myopic  policy.  The  structure  of  the  myopic  policy  under  multi-channel  sensing 
has  been  established,  and  the  myopic  policy  has  been  shown  to  be  optimal  when  M  =  N  —  1. 
Furthermore,  an  approximation  factor  of  the  myopic  policy  has  been  developed  for  general  M 
and  N.  Interestingly,  the  approximation  factor  we  establish  in  this  paper  coincides  with  the  one 
obtained  in  [7], 

2)  Imperfect  State  Obsen’ation:  Under  imperfect  sensing,  the  design  of  multi-channel  op¬ 
portunistic  access  was  addressed  in  [8]  under  a  general  correlated  channel  model.  This  problem 
requires  the  joint  design  of  the  channel  state  detector,  the  access  policy  and  the  sensing  policy.  A 
separation  principle  has  been  established  which  decouples  the  design  of  channel  state  detector  and 
access  policies  from  that  of  channel  sensing  policy.  The  channel  sensing  policy  then  falls  into  an 
unconstraint  POMDP  problem.  In  [9],  the  structure  and  optimality  of  the  myopic  sensing  policy 
has  been  established  under  certain  conditions  for  independent  and  identical  channels.  Specifically 
under  single-channel  sensing,  a  simple  and  robust  round-robin  structure  of  the  myopic  policy  has 
been  established  when  the  false  alarm  probability  of  the  channel  state  detector  is  below  a  certain 
value.  Based  on  this  structure,  the  myopic  policy  has  been  shown  to  be  optimal  for  N  =  2.  In 
this  paper,  we  extend  the  structure  of  the  myopic  policy  to  multi-channel  sensing  scenarios. 

II.  Problem  Formulation 

A.  System  Model 

Let  S(t)=[Si(t), . . . ,  SN(t)]  denote  the  channel  states,  where  Sn(t)  G  {0  (bad,  1  (good))} 
is  the  state  of  channel  n  in  slot  t.  At  the  beginning  of  each  slot,  the  user  first  decides  which 
M  channels  to  sense  for  potential  access.  Once  a  channel  (say  channel  n)  is  chosen,  the  user 
detects  the  channel  state,  which  can  be  considered  as  a  binary  hypothesis  test1: 

Ho  ■  Sn(t )  =  1  (good)  vs.  H\  :  Sn(t )  =  0  (bad). 

The  performance  of  channel  state  detection  is  characterized  by  the  ROC  which  relates  the 
probability  of  false  alarm  e  and  the  probability  of  miss  detection  5: 

e=Pr{decide  HfHo  is  true},  <5=Pr{decide  Hq\Hi  is  true}. 


'We  consider  here  the  nontrivial  cases  with  poi  and  pn  in  the  open  interval  of  (0,1).  When  they  take  the  special  value  of  0 
or  1,  channel  state  detection  can  be  simplified.  Extensions  to  such  special  cases  are  straightforward. 
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Based  on  the  imperfect  detection  outcome  in  slot  t,  the  user  chooses  an  access  action  <!>„(£)  G 
{0  no  access,  1  access}  that  determines  whether  to  access  channel  n  for  transmission.  We  note 
that  the  design  should  be  subject  to  a  constraint  on  the  probability  of  accessing  a  bad  channel, 
which  may  cause  interference  or  waste  energy.  Specifically,  the  probability  of  collision  Vn(t) 
perceived  by  the  primary  network  in  any  channel  and  slot  is  capped  below  a  predetermined 
threshold  (,  i.e., 

P„(t)=Pr($n(t)  =  1| Sn(t)  =  0)  <  C,  V  n,  t. 

This  constrained  stochastic  optimization  problem  requires  the  joint  design  of  the  channel  state  de¬ 
tector  {i.e.,  how  to  choose  the  detection  thresholds  to  trade  off  false  alarms  with  miss  detections), 
the  access  policy  that  decides  the  transmission  probabilities  based  on  imperfect  detection  out¬ 
comes,  and  the  sensing  policy  for  channel  selection.  This  problem  is  formulated  as  a  constrained 
POMDP  in  [8]  for  generally  correlated  channels.  A  separation  principle  has  been  established 
that  the  optimal  detector  is  the  Neyman-Pearson  detector  with  the  probability  5  of  miss  detection 
given  by  the  maximum  allowable  probability  Q  of  collision,  and  the  optimal  access  policy  is  to 
simply  trust  the  detection  outcomes:  transmit  over  a  channel  if  and  only  if  it  is  detected  as  good. 
Thus,  the  user  can  obtain  a  unit  reward  on  a  chosen  channel  if  and  only  if  it  is  in  good  state 
and  detected  correctly  {i.e.,  no  false  alarm).  The  optimal  sensing  policy  can  then  be  designed 
using  the  optimal  detector  and  the  optimal  access  policy  without  the  constraint  on  accessing 
a  bad  channel,  which  becomes  an  unconstrained  POMDP  addressed  here.  The  objective  is  to 
maximize  the  average  reward  (throughput)  over  a  horizon  of  T  slots  by  choosing  judiciously  a 
sensing  policy  that  governs  channel  selection  in  each  slot. 

Since  failed  transmission  may  occur,  acknowledgements  (ACKs)  are  necessary  to  ensure 
guaranteed  delivery.  Specifically,  when  the  receiver  successfully  receives  a  packet  from  a  channel, 
it  sends  an  acknowledgement  to  the  transmitter  over  the  same  channel  at  the  end  of  the  slot. 
Otherwise,  the  receiver  does  nothing,  i.e.,  a  NAK  is  defined  as  the  absence  of  an  ACK,  which 
occurs  when  the  transmitter  did  not  transmit  over  this  channel  or  transmitted  but  the  channel  is  in 
bad  state.  We  assume  that  acknowledgements  are  received  without  error  since  acknowledgements 
are  always  transmitted  over  good/idle  channels. 
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B.  Restless  Multi-Armed  Bandit  Formulation 

Due  to  limited  and  imperfect  sensing,  the  system  state  [S'i(f),  •  •  •  ,  Sjv(f)]  G  {0, 1}^  in  slot 
f  is  not  fully  observable  to  the  user.  It  can,  however,  infer  the  state  from  its  decision  and 
observation  history.  It  has  been  shown  that  a  sufficient  statistic  of  the  system  for  optimal 
decision  making  is  given  by  the  conditional  probability  that  each  channel  is  in  state  1  given 
all  past  decisions  and  observations  [10].  Referred  to  as  the  belief  vector,  this  sufficient  sta¬ 
tistic  is  denoted  by  0(f)  =  [u;i(f),  •  •  •  , c ujv(f)],  where  c Oi(t)  is  the  conditional  probability  that 
Si(t)  =  1.  In  order  to  ensure  that  the  user  and  its  intended  receiver  tune  to  the  same  channels  in 
each  slot,  channel  selections  should  be  based  on  common  observations:  the  acknowledgements 
/C(f)  G  {0  (NAK),  1  (ACK)}m  in  each  slot  rather  than  the  detection  outcomes  at  the  transmitter. 
Let  I  (f )  denote  the  sensing  action  that  consists  of  M  channels  to  sense  in  slot  t.  Given  the  sensing 
action  7(f )  and  the  observations  (7Q(f)  e  {0, 1}  :  i  G  7(f)}  in  slot  t,  the  belief  vector  for  slot 
t  +  1  can  be  obtained  via  the  Bayes  rule. 

Pn,  i  e  I(t),Ki(t)  =  1 

w<(t  +  1)  =  j  rUAw,,).  izmw)  =  o  (1) 

i WiM),  i  /  m 

where  the  operator  T(-)  is  defined  as 

T(x)=xpn  +  (1  -  x)p01. 

A  sensing  policy  7r  specifies  a  sequence  of  functions  n  =  [7^,  7t2,  *  •  •  ,  7 tt]  where  irt  maps  a 
belief  vector  O(t)  to  a  sensing  action  7(f)  for  slot  f.  Multi-channel  opportunistic  access  can  thus 
be  formulated  as  the  following  stochastic  optimization  problem. 

'  T 

7T*  =  argmaxE,r  T?(7rf(0(f)))|0(l)  , 

7 T  *  ^ 

_t=  1 

where  i?(7rt(0(f)))  is  the  reward  obtained  when  the  belief  is  O(f)  and  channels  7rt(0(f))  are 
selected,  and  0(1)  is  the  initial  belief  vector.  This  problem  falls  into  the  model  of  an  RMBP  by 
treating  the  belief  value  of  each  channel  as  the  state  of  each  arm  of  a  bandit.  If  no  information 
on  the  initial  system  state  is  available,  each  entry  of  0(1)  can  be  set  to  the  stationary  distribution 
ua  of  the  underlying  Markov  chain: 

P01 

P01  +  P10 ' 


u, 


(2) 
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Let  Vt(Cl)  be  the  value  function,  which  represents  the  maximum  expected  total  reward  that 
can  be  obtained  starting  from  slot  t  given  the  current  belief  vector  Cl.  Given  that  the  user  takes 
action  /  and  observes  JC  =  the  expected  reward  that  can  be  accumulated  starting  from 

slot  t  consists  of  two  parts:  the  expected  immediate  reward  —  e)  and  the  maximum 

expected  future  reward  Vt+1(T ((-l\I .  JC)),  where  T (Cl\I,  JC)  denotes  the  updated  belief  vector  for 
slot  t  +  1  after  incorporating  action  /  and  observations  JC  as  given  in  (1).  Averaging  over  all 
possible  observations  JC  and  maximizing  over  all  actions  /,  we  arrive  at  the  following  optimality 
equation. 

Vt(Q(T))  =  max  T,ieIUi ( 1  -  e), 

Vt(Cl(t))  =  max(Sje/cCj(l  -  e)  +  E[Um  (T  (Q(t)\I,  1C))}. 

In  theory,  the  optimal  policy  7r*  and  its  performance  Vi (0(1))  can  be  obtained  by  solving 
the  above  dynamic  programming.  Unfortunately,  due  to  the  impact  of  the  current  action  on  the 
future  reward  and  the  uncountable  space  of  the  belief  vector,  obtaining  the  optimal  solution  using 
directly  the  above  recursive  equations  is  computationally  prohibitive.  Even  when  approximate 
numerical  solutions  can  be  obtained,  they  do  not  provide  insight  into  system  design  or  analytical 
characterizations  of  the  optimal  performance  Ui(f2(l)). 

III.  Structure  and  Performance  of  The  Myopic  Policy 
A.  Myopic  Policy 

A  myopic  policy  ignores  the  impact  of  the  current  action  on  the  future  reward,  focusing  solely 
on  maximizing  the  expected  immediate  reward  E[f?(/(f))].  Myopic  policies  are  thus  stationary. 
The  myopic  action  I  under  belief  state  Cl  =  [lu1,  ■  ■  ■  ,lun]  is  simply  given  by 

I  (Cl)  =  argmaxE  jg/cUi.  (3) 

In  general,  obtaining  the  myopic  action  in  each  slot  requires  the  recursive  update  of  the  belief 
vector  Cl  as  given  in  (1),  which  requires  the  knowledge  of  the  transition  probabilities  {[>,  -,} ■ 
Interestingly,  it  has  been  shown  in  [9]  under  single-channel  sensing  (M  =  1)  that  the  myopic 
policy  has  a  simple  structure  that  does  not  need  the  update  of  the  belief  vector  or  the  precise 
knowledge  of  the  transition  probabilities  if  the  probability  of  false  alarm  is  below  a  certain  value. 
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Surprisingly,  the  myopic  policy  with  such  a  simple  and  robust  structure  achieves  the  optimal 
performance  for  N  =  2  [9].  Under  multi-channel  sensing  (M  >  1),  extensive  simulations  have 
shown  that  the  myopic  policy  achieves  the  optimal  performance.  We  thus  conjecture  that  the 
optimality  of  the  myopic  policy  holds  for  general  M  and  N.  In  the  next  section,  we  show  that 
the  structure  of  the  myopic  policy  can  be  directly  generalized  to  multi-channel  sensing  scenarios. 
Based  on  this  structure,  we  characterize  the  performance  of  the  myopic  policy. 

B.  Structure 

We  first  present  the  following  assumptions. 

Al:  The  initial  belief  values  are  bounded  between  pm  and  pu. 

A2: 

<  min{poi,pn}(l  -  max{poi,pn}) 

“  max{poi,pn}(l  -  min{poi,Pn}  ’ 

Assumption  Al  will  only  be  used  in  Theorem  1  which  describes  the  structure  of  the  myopic 
policy.  We  note  that  the  structure  can  be  directly  extended  if  assumption  Al  does  not  hold.  We 
assume  Al  in  Theorem  1  for  the  easy  of  presentation. 

For  Assumption  A2,  the  allowed  probability  of  miss  detection  5  plays  a  major  role  since  e 
can  be  reduced  to  an  arbitrarily  small  value  at  the  price  of  increased  5.  However,  both  e  and  5 
can  be  improved  by  increasing  the  sensing/detection  time  ( i.e taking  more  measurements).  The 
caveat  is  the  reduced  transmission  time  for  a  given  slot  length.  This  interesting  tradeoff  between 
the  complexity  of  the  detector  at  the  physical  layer  and  the  transmission  strategy  at  the  Medium 
Access  Control  (MAC)  layer  of  a  communication  network  can  be  complex  and  is  beyond  the 
scope  of  this  paper. 

The  implementation  of  the  myopic  policy  can  be  described  with  a  queue  structure.  Specifically, 
all  N  channels  are  ordered  in  a  queue,  and  in  each  slot,  those  M  channels  at  the  head  of  the 
queue  are  sensed. 

Theorem  1:  The  Semi-Universal  structure  of  the  myopic  policy 
The  initial  channel  ordering  Q  (1)  is  determined  by  the  initial  belief  vector  as  given  below. 


&ni  (1)  2  '  "  ^ 


Q(l)  =  (nr,  •  •  •  ,  nN). 
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Under  assumption  A1  and  A2,  channels  are  reordered  at  the  end  of  each  slot  according  to  the 
following  simple  rules.  When  pu  >  pm,  the  channels  observed  with  ACK  will  stay  at  the  head 
of  the  queue,  and  the  channels  observed  with  NAK  will  be  moved  to  the  end  of  the  queue  while 
keeping  their  order  unchanged.  When  pn  <  Poi,  the  channels  observed  with  NAK  will  stay  at 
the  head  of  the  queue  while  reversing  their  order,  and  the  channels  observed  with  ACK  will  be 
moved  to  the  end  of  the  queue.  The  order  of  the  unobserved  channels  are  also  reversed. 

Proof:  Let  Q  (l)  =  (ni,n2,---  ,  un)  ('+  G  {1,2,  •••  ,1V}  Vi)  be  the  queueing  order  of 
channels  in  slot  1.  We  need  to  show  that 

uni(t)  >  ■  ■  ■  >  unN(t)-  (4) 

We  first  present  the  following  properties  of  the  operator  T(x)  defined  in  (1). 

PI.  T(x)  is  an  increasing  function  for  pu  >  Poi  and  a  decreasing  function  for  pu  <  pn i  • 

P2.  VO  <  X  <  1,  Poi  <  r(x)  <  Pu  for  pu  >  Poi  and  pu  <  r(x)  <  p0i  for  pu  <  p01. 

P3.  For  pu  >  Poi  and  e  <  gem,  we  have  r(^+£(}_^)  <  r(o/)  Vp0i  <  u,u'  <  pu,  for 
Pu  <  Poi  and  e  <  we  have  r(aj+g_tj))  >  r(w')  Vpn  <  u,u'  <  p0 1- 

PI  and  P2  follow  directly  from  the  definition  of  T(x).  To  show  P3  for  pu  >  Poi,  it  suffices  to 
show  £CJ+(i_u)  —  PQ1  hue  to  the  monotonically  increasing  property  of  T(x)  and  the  bound  on 
u>'.  Noticing  that  is  an  increasing  function  of  both  uj  and  e,  we  arrive  at  P3  by  using 

the  upper  bounds  on  c o  and  e.  Similarly,  we  can  show  P3  for  pu  <  Poi- 

We  now  prove  (4)  by  induction.  For  t  —  1,  (4)  holds  by  the  definition  of  Q(l).  Assume 
that  (4)  is  true  for  slot  t.  We  show  that  it  is  also  true  for  slot  1+1. 

Consider  first  pu  >  Poi-  For  an  1  <  i  <  M  with  Krit  =  1,  u>ni(t  +  1)  =  pu  which  achieves 
the  upper  bound  of  the  belief  values  (See  P2).  For  an  1  <  j  <  M  with  Knj  =  0,  c onj(t  +  1)  is 
upper  bounded  by  those  of  unobserved  channels  due  to  P3.  Among  those  channels  observed  0, 
the  order  of  their  believes  remains  unchanged  in  slot  1  +  1  due  to  PI.  Similarly,  the  order  of  the 
belief  values  of  the  unobserved  channels  also  remains  unchanged  in  slot  1  +  1. 

For  pu  <  Pod  the  belief  values  of  channels  observed  1  will  achieve  the  lower  bound  pn  of 
the  belief  values  (See  P2).  For  an  1  <  j  <  M  with  Kn.  =  0,  ujnj(t  +  1)  is  lower  bounded  by 
those  of  unobserved  channels  due  to  P3.  Among  those  channels  observed  0,  the  order  of  their 
believes  will  be  reversed  in  slot  1  +  1  due  to  PI.  Similarly,  the  order  of  the  belief  values  of  the 
unobserved  channels  will  also  be  reversed  in  slot  1  +  1. 
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We  thus  proved  (4)  for  all  t  >  1  under  the  structure  of  the  myopic  policy.  ■ 

Based  on  this  structure,  the  myopic  policy  can  be  implemented  without  knowing  the  channel 
transition  probabilities  except  the  order  of  pn  and  p0i-  As  a  result,  the  myopic  policy  is  robust 
against  model  mismatch  and  automatically  tracks  variations  in  the  channel  model  provided  that 
the  order  of  pu  and  p01  remains  unchanged.  Following  the  belief-independence  property  of  this 
simple  structure,  we  present  the  following  corollary  which  allows  us  to  work  with  a  Markov 
reward  process  with  a  finite  state  space  instead  of  one  with  an  uncountable  state  space  (i.e.,  , 
belief  vectors)  as  we  encounter  in  a  general  POMDP. 

Corollary  1:  Let  Q (t)  =  (n\,  n2,  •  •  •  ,  un)  (n*  £  {1,  2,  •  •  •  ,  IV}  Vi)  be  the  queueing  order  of 
channels  in  slot  t,  where  myopic  action  I(t )  =  {ni}^£1.  Define  S(t)=[Sni(t),  Sn2(t),  •  •  •  ,  SnN(t)] 
and  E(f)={ei(f),  e2(f),  •  •  •  ,eM(t)},  where  {ej(f)}i<j<M,  t> i  are  i.i.d.  binary  random  variables 
taking  value  0  with  probability  e  and  1  with  probability  1  —  e.  Under  assumption  A2,  the 
augmented  Markov  process  G(f)=[S(f),  E(f)]  form  a  2N+M— state  Markov  chain,  and  the  per¬ 
formance  of  the  myopic  policy  is  determined  by  the  Markov  reward  process  (G (t),R(t))  with 
R(t)  =  ^=iSni(t)ei(t). 

Proof:  G (t)  specifies  the  states  of  all  channels,  the  queueing  order  of  channels  under  the  my¬ 
opic  policy,  and  the  observations  obtained  in  slot  t.  Specifically,  the  observation  (0  (NAK),  1  (ACK)) 
on  channel  nl  (1  <  i  <  M)  in  slot  t  is  given  by  Sni(t)e.i(t).  Based  on  the  structure  of  the  myopic 
policy,  G(£)  determines  the  probability  distribution  of  G(f  +  1),  i.e.,  G(f)  is  a  Markov  chain. 
Furthermore,  the  reward  R(t)  in  slot  t  is  given  by  the  number  of  channels  observed  with  ACK. 

■ 

Theorem  1  and  Corollary  1  provides  foundations  in  analyzing  the  performance  of  the  myopic 
policy. 

C.  Performance 

In  this  section,  we  analyze  the  performance  of  the  myopic  policy.  Under  the  optimality 
conjecture  (see  Sec.  III-A),  the  throughput  achieved  by  the  myopic  policy  defines  the  performance 
limit  of  a  multi-channel  opportunistic  communications  system.  In  particular,  we  are  interested 
in  the  relationship  between  the  throughput  achieved  by  the  myopic  policy  and  the  number  N  of 
channels. 


TECHNICAL  REPORT  TR-09-01,  UC  DAVIS,  MARCH,  2009. 


11 


1)  Uniqueness  of  Steady-State  Performance  and  Its  Numerical  Evaluation:  We  first  establish 
the  existence  and  uniqueness  of  the  system  steady-state  performance  under  the  myopic  policy. 
The  steady-state  throughput  under  the  myopic  policy  is  given  by 


EA(fi(l))  =  lim 


(5) 


where  Vi:T(n(l))  is  the  expected  total  reward  obtained  in  T  slots  under  the  myopic  policy 
when  the  initial  belief  is  0(1).  From  Corollary  1,  C/ (0(1))  is  determined  by  the  Markov  reward 
process  (G (t),R(t)}.  It  is  easy  to  see  that  the  2N+M-state  Markov  chain  (G(t)}  is  irreducible 
and  aperiodic,  thus  has  a  limiting  distribution.  As  a  consequence,  the  limit  in  (5)  exists,  and  the 
steady-state  throughput  U  is  independent  of  the  initial  belief  value  0(1). 

Corollary  1  also  provides  a  numerical  approach  to  evaluating  U  by  calculating  the  limiting 
(stationary)  distribution  of  G  (t)  whose  transition  probabilities  can  be  directly  obtained  from  the 
transition  probabilities  of  the  channel  states.  This  numerical  approach,  however,  does  not  provide 
an  analytical  characterization  of  the  throughput  U  in  terms  of  the  number  N  of  channels  and 
the  transition  probabilities  {pij}-  In  the  next  section,  we  obtain  analytical  expressions  of  U  and 
its  scaling  behavior  with  respect  to  N  based  on  a  stochastic  dominance  argument. 

2)  Analytical  Characterization  of  Throughput:  From  the  structure  of  the  myopic  policy,  the 
throughput  is  determined  by  how  often  the  user  switches  channels.  When  pu  >  p0l,  the  event  of 
a  channel  switching  is  equivalent  to  a  slot  without  reward.  The  opposite  holds  when  pu  <  PoT  a 
channel  switching  corresponds  to  a  slot  with  reward.  For  both  cases,  we  note  that  the  user  may 
switch  to  the  same  channel  when  a  channel  switch  is  needed. 

We  thus  introduce  the  concept  of  transmission  period  (TP),  which  is  the  time  period  starting 
from  the  slot  the  user  switches  to  a  channel  and  ending  at  the  slot  that  the  next  switch  on  this 
channel  is  needed  (see  Fig.  2  for  an  example  under  single-channel  sensing).  Note  that  the  user 
may  switch  to  the  same  channel.  We  count  the  transmission  periods  in  the  order  of  its  starting 
point.  Let  L/c  denote  the  length  of  the  A  th  TP.  We  then  have  a  discrete-time  random  process 
(Lfc}^=1  with  a  state  space  of  positive  integers. 

Lemma  1: 

M{1  —  If L),  pu>Poi 
M/L,  Pu<Poi 


U  = 


(6) 


where  L  =  lim 


K — xx) 


ELi  l a 

I\ 


denotes  the  average  length  of  a  TP. 
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channel  switching 


Fig.  2.  The  transmission  period  structure. 


Proof:  Consider  first  pu  >  p01.  Let  kT  denote  the  number  of  channel  switches  during  a 
finite  horizon  of  length  T.  Since  a  channel  switch  represents  a  loss  of  a  unit  of  reward,  the 
throughput  Ut  during  the  finite  horizon  is  given  below. 

MT  —  kT 


UT  = 


T 


(7) 


Let  jr  denote  the  number  of  TPs  during  the  finite  horizon.  We  have  jr  =  M  +  kr  since  a 
channel  switch  initializes  a  new  TP.  It  is  easy  to  see  that  MT  <  Tf^kTLk  <  MT  +  Lk. 


Note  that  the  length  of  a  TP  is  finite  almost  surely.  We  thus  have 


,  MT 

lim  — — 

T^-oo  kT 


lim 

T— 


^i= 1  Lk 

M  kj' 


L.  ( a.s .) 


(8) 


From  (7)  and  (8),  we  have 


U  =  lim  UT  =  M  lim  (1  -  A=)  =  M(  1  -  1  /£).  (a.s.)  (9) 

1  — >oo  1  — >oo  1V1  1 

The  case  for  pn  <  pm  can  be  similarly  obtained  by  observing  that  a  channel  switch  represents 
a  gain  of  one  unit  reward.  ■ 

Based  on  Lemma  1,  throughput  analysis  is  reduced  to  analyzing  the  average  TP  length  L. 
We  note  that  the  distribution  of  Lk  is  determined  by  the  belief  value  in  the  first  slot  of  the 
k— th  TP.  Under  single-channel  sensing  (M  =  1),  the  approach  is  to  construct  first-order  Markov 
chains  that  stochastically  dominate  or  are  dominated  by  {Lk}^=1.  The  stationary  distributions  of 
these  first-order  Markov  chains,  which  can  be  obtained  in  closed-form,  lead  to  lower  and  upper 
bounds  on  U  according  to  (6).  Specifically,  for  pn  >  poi,  a  lower  bound  on  U  is  obtained  by 
constructing  a  first-order  Markov  chain  whose  stationary  distribution  is  stochastically  dominated 
by  the  stationary  distribution  of  {Lk}™=1.  An  upper  bound  on  U  is  given  by  a  first-order 
Markov  chain  whose  stationary  distribution  stochastically  dominates  the  stationary  distribution 
of  {Lk}™=l.  Similarly,  bounds  on  U  can  be  obtained  for  pu  <  Poi- 
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Theorem  2:  Define  functions 


f(x)± _ u°~x _ 

1  _  rfl  _  A(~\  —  (I’ii-Vm)'.  I-— /Ml (  i-c)) 

i  e/li  l-(pu-poi)Pll(l-«)  ■ 


,  , .  a  1  —  cc0(l  —  e)  +  a 

hix,  V,  z,  a,  b )-  (W(-,  -pul)2+(pil-pul)t+i),  ~ > 

'  1-((P11-P0l)2/)2  ' 

and  for  any  function  v(-)  of  vector  [x,  y,  z,  a,  b]. 


g  o  v(x,  y,  z,  a,  b)=~w^ - — - (10) 

KjyW  -x)v{x,y,z,a,  b)  +  1 

Under  assumption  A2,  we  have  the  following  lower  and  upper  bounds  on  the  throughput  U  when 


M  —  1. 


Case  1:  pu  >  poi 


f(c i)(l~e)  <  rj  <  ^q(!  - f) 

1  -  (Pu  -  /(ci))(l  -  e)  ~  “  1  -  (pn  -  w0)(l  -  e)  ’ 


where  uQ  is  given  by  (2)  and 


Cl  =  (ccD  -  c2)(pn -Poi)^  1, 
Poi(l  —  Po  i  +  epu) 


c2  = 


1  -  pm  +  epoi 


Case  2:  pu  <  p01 


where 


9  °  h(x  1,2/1,  Zi,  ai,  27V  -  4)  <  <  g  o  h(  —  ,  1  -  zi,  1  -  yu  au  3), 


—  - 

Pll(Pll  -Poi)  +Poi’ 

Vi  =  1  -  (!  -  e)(pn(pn  -Poi)  +Poi), 


U  =  (l-e)poi, 

ai  =  (1  -  e)(a;0 -pn)(pii -poi). 


Proof: 

Case  1:  pu  >  p0i 
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Let  LUk  denote  the  belief  value  of  the  chosen  channel  in  the  first  slot  of  the  /,  -th  TP.  The  length 
Lk(uk)  of  this  TP  has  the  following  distribution. 


Pr  [Lk{uk)  =  l}  = 


1-04(1-6),  1  =  1 
uk(l  -  e)fc_Vii2(f  -  Pn(l  -  e)),  l  >  1 

It  is  easy  to  see  that  if  u /  >  u,  then  Lk( uj')  stochastically  dominates  Lk( u). 


(13) 


(a)  pu  >  poi  (b)  pu  <  p01 

Fig.  3.  The  j-step  belief  update  when  unobserved. 


Note  that  the  j-step  belief  update  rj(u;)  when  unobserved  is  given  by  (See  Fig.  3) 

rj (cc)  =uj0-  (u0  —  u>) (pn  -PoiY- 

Based  on  the  structure  of  the  myopic  policy,  we  have  04  =  r^Jfc+1^(£x^_j),  where  Jk  = 
Lk_i  denotes  the  number  of  consecutive  slots  in  which  the  chosen  channel  has  been 
unobserved  since  the  last  visit,  and  x  denotes  the  belief  value  of  the  chosen  channel  at  the  last 
time  the  user  left  it.  From  assumption  A2,  r  ( ex^_x )  <  r(p0i)  <  0; 0 ,  where  u0  is  the  stationary 
distribution  of  the  Gilbert-Elliot  channel  given  in  (2).  Based  on  the  monotonic  convergence 
property  of  the  j-step  belief  update  (see  Fig.  3  (a)),  we  have  uk  <  luq.  Lk{uj0)  thus  stochastically 
dominates  Lk{ujk),  and  the  expectation  of  the  former,  Lk(uj0)  =  1  +  ,  leads  to  the  upper 

bound  of  U  given  in  (11). 

Next,  we  prove  the  lower  bound  of  U  by  constructing  a  hypothetical  system  where  the  initial 
belief  value  of  the  chosen  channel  in  a  TP  is  a  lower  bound  of  that  in  the  real  system.  The 
average  TP  length  in  this  hypothetical  system  is  thus  smaller  than  that  in  the  real  system, 
leading  to  a  lower  bound  on  U  based  on  (6).  Specifically,  since  04  =  r ^Jk+1\ exl*_x)  and 
*  =  Elf  >  IV  +  E-,  -  2,  we  have  Ut  >  >  r K+t‘---1(.poj;i-w,) 
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based  on  the  monotonic  increasing  property  of  the  j-step  belief  update  (see  Fig.  3  (a)).  We  thus 
construct  a  hypothetical  system  given  by  a  first-order  Markov  chain  {L'k}kL1  with  the  following 
transition  matrix  R  =  {r„}. 


i  -  r^-v — 9*1 — v  i  >  i  j  =  i 

vepoi+l-poi  n  —  ’  J 

rW-'l  J.,)!1  -  t)J-‘(Pny-2(l  -Pn(l  ~  ()),  <>1,  i>2 


(14) 


Lemma  2:  The  stationary  distribution  of  the  first  order  Markov  chain  {L,k}kL:1  is  stochastically 
dominated  by  the  stationary  distribution  of  {Lk}™=1. 

Proof: 

Let  uk  denote  the  expected  probability  that  the  chosen  channel  is  in  state  1  in  the  first  slot  of 
the  k- th  transmission  period  of  { L'k }  (E , .  Assume  in  the  A-th  transmission  period,  the  distributions 
of  L'k  and  Lk  both  equal  to  the  same  distribution  A ,  which  may  or  may  not  be  the  stationary 
distribution  of  {Lk}^=1  .  Next  we  show  u>k+n  >  ^'k+n  f°r  anY  >  1  by  induction. 

When  n  —  1,  we  have 


=  s^1ELfc_w+a>..,Lfc_1[r1+E«-^( 


ex 


ex  +  1  —  x 
epoi 


)\Lk  =  l]Pr(Lk  =  l) 


>  ;T  -  )\L*  =  l]Pr(Lt  =  l ) 

tp  01  +  t  —  Pol 


= 


=  u. 


N+2, 

oo  -pAT— 1+Z/ 


epoi 


'epoi  +  I-P01 


yfc+ 1- 

Assume  LUk+n  >  uk+n,  then 


(15) 


<*W+i  =  ^  )\Lk+n  =  l]Pr(Lk+n  =  l ) 

tx  I  J.  X 

>  f01  )|£t+„  =  (]Pr(Ll+n  =  0 

ePoi  +  J-  —  Poi 

=  E^-1^ - - )Pr(Lk+n  =  l )  (16) 

epoi  +  1  -  Poi 

Since  uok+n  >  c Jk+n,  by  (13),  we  have 


Pr(Lfc+n  =  /)  <  Pr(L'fc+n  =  l),  if  1  =  1; 
Pr(Lk+n  =  l)>Pr(L'k+n  =  l),  if  l  >  1. 


(17) 
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Since  the  smallest  number  in  the  series  1+^(CTO1+i1_MJ1 )  is  the  first  one,  by  (17)  and  the 
fact  that  E^1  Pr (Lk+n  =  /)  =  Pr (L'k+n  =  /)  =  !,  we  have 


pjoo  pAT— 1+Z 


epoi 

epoi  +  1  -  Poi 


)  P-^(-^fc+n 


/)  >  ^l^N~1+l 


ep  01 


epoi  +  1  -  poi 


)  Pr(L' +n  =  l) 

^ k+n+l 

(18) 


Combine  (16)  and  (18),  we  have  tUk+n+i  >  to'k+n+i- 

By  the  above  induction,  we  have  ujk+n  >  u^+n  for  any  n  >  1.  So  the  stationary  distribution 
of  the  first  order  Markov  chain  {L'k}kL1  is  dominated  by  the  stationary  distribution  of  {L/.}/p=]. 


Let  L'  denote  the  average  length  of  a  transmission  period  of  L'k.  Based  on  (6)  and  Lemma  2, 
L'  leads  to  a  lower  bound  on  U.  Last,  we  obtain  closed-form  L'  by  solving  the  stationary 
distribution  of  the  first-order  Markov  chain  {L^.}^=1. 

Recall  that  R  =  {r^-}  is  the  transition  matrix  of  {Ckyk=^  where  rtJ  is  given  in  (14).  Let 
R(:,  k)  denote  the  k- th  column  of  R.  We  have 

1-R(:,1)=  R(:'2>  R(:,*)  =  R(:,2)(p11(l-e))t-2,  (k  >  2)  (19) 

1  _  Pill1  _  e) 

where  1  is  the  unit  column  vector  [1, 1,  By  the  definition  of  stationary  distribution,  we  have, 
for  k  —  1,  2,  •  •  •  , 

[Ai,  A2,  •  •  ■  ]R(:,  k)  =  Afc,  (20) 

which,  combined  with  (19),  leads  to 

Ai  =  1  —  t- - ~77 - rT  1  Afc  =  A2(pn(l  —  e))k~2.  (k  >  2)  (21) 

(l-pn(l-e)) 

Substituting  (21)  into  (20)  for  k  =  2  and  solving  for  A2,  we  have  A2  =  /(ci)(l— e)(l—  pn(l— e)), 
where  /(ci)  is  given  in  (11).  From  (21),  we  then  have  the  stationary  distribution  as 

A k  =  f  1  -/(ci)(!  -e),  k  =  1  (22) 

\  /(ci)(l-e)(pii(l-e))fc_2(l-pii(l-e)),  k>l 
which  leads  to  L  =  kXk  =  1  +  • 


Case  2:  p\\  <  p0 1 
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Let  jJ}:  denote  the  belief  value  of  the  chosen  channel  in  the  first  slot  of  the  A'-th  TP.  Define  the 
operator  c(-)  as  c(x)  =  ex^[_  r  -  We  have 


Pr[Lfc(ccfc)  =  l] 


ccfc(l-e),  1  =  1 

(1  -  uk(  1  -  e))  ntil1  -  (r  o  c)*(ccfc)(l  -  e))(T  o  c)*_1(a;fc)(l  -  e),  l  >  1 

(23) 


Consider  first  the  upper  bound.  We  construct  the  following  hypothetical  system  where  the 
stationary  distribution  of  a  TP  is  stochastically  dominated  by  the  one  in  the  real  system.  The 
average  TP  length  in  this  hypothetical  system  is  thus  smaller  that  in  the  real  system,  leading 
to  a  upper  bound  on  U  based  on  (6).  Specifically,  the  distribution  of  a  TP  in  the  hypothetical 
system  has  the  following  form. 


PrKM  =  I]  = 


I’(pn) 

Poi 


b>k(  1  —  e)  +  1 


r(pn) 

Poi  ’ 


l  =  1 


(24) 


(1  -  uk(  1  -  e))(l  -  Poi(l  -  e))fc  2T(pn)(l  -  e),  l  >  1 
We  first  show  that  L'k(u>k)  is  stochastically  dominated  by  Lk{uk).  Note  that  Pr [L'k{uk)  =  l]  >  0 
for  all  l  e  Z+  and  Pi[Lk(uk)  —  l]  —  1.  The  distribution  of  Lk(uk)  given  in  (24)  is  thus 
well-defined.  Since  T(pu)  <  T  o  c{u )  <  p0 1  for  any  pu  <  uj  <  p01,  we  have  Pr [L'k(uk)  —  l]  < 
Pr[Lk(ujk)  =  /]  for  all  /  >  2.  L'k(cok)  is  thus  stochastically  dominated  by  Lk{uk). 

It  is  easy  to  see  that  L'k{u')  is  stochastically  dominated  by  if  u/  >  to.  L'k{u')  is  thus 

stochastically  dominated  by  Lk{uj)  if  c J  >  c o.  Based  on  the  structure  of  the  myopic  policy,  it  is 
clear  that  when  Lk~ i  is  odd,  in  the  k- th  TP,  the  user  will  switch  to  the  channel  visited  in  the 
(k— 2)-th  TP.  As  a  consequence,  the  initial  belief  uk  of  the  k- th  TP  is  given  by  uk  =  r(Lfc-1+1^(l). 
When  Lk_i  is  even,  we  can  show  that  ujk  <  r^Lfc-1+P(l).  This  is  because  that  for  Lk_  1  even, 
the  user  cannot  switch  to  a  channel  visited  Lk_ i  +  2  slots  ago,  and  rj(l)  decreases  with  j  for 
even  j’s  and  PJ(1)  >  P'(l)  for  any  even  j  and  odd  i  (see  Fig.  3  (b)).  We  thus  construct  a 
hypothetical  system  given  by  the  first-order  Markov  chain  {L'k}™=l  with  the  following  transition 
probabilities. 

^i^ri+1(l)(l  -  e)  +  1  - 

P01  V  '  V  >  POI  ’ 

(1  -  ri+1(l)(l  -  e))(l  -poi(l  -  e)y-2T(Pll)(l  -  e), 
l(l)(l  —  e)  +  1  — 


nj  =  < 


r(pu)pt+4/ 
POI 


POI 


if  i  is  odd,  j  =  1 
if  i  is  odd,  j  >  2 
if  i  is  even,  j  =  1 


[  (1  -  ri+4(l)(l  -  e))(l  -p0i(l  -  e))J  2r(pn)(l  -  e),  if  i  is  even,  j  >  2 
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Similarly  to  Lemma  2,  it  can  be  shown  that  the  stationary  distribution  of  {L,k}kL:1  is  stochastically 
dominated  by  that  of  {Lk}^=1.  Furthermore  the  stationary  distribution  of  { L'k } kL  i  can  t>e  obtained 
in  closed  form  by  using  an  approach  similar  to  that  in  Case  1,  leading  to  the  upper  bound  on 
U  given  in  (12). 

We  now  prove  the  lower  bound.  Consider  the  hypothetical  system  with  the  distribution  of  a 
TP  as  given  below. 


P01 


v  oi 


Pr[LiM  =  (]  =  {  f)  +  1  r”>>’  '  1 

(1  -  uk(  1  -  e))(l  -  r(pn)(l  -  e))k  2Poi(1  -  e),  l  >  1 


(25) 


Similarly,  L'k{uk)  is  well-defined  and  stochastically  dominates  Lk{uk).  It  is  easy  to  see  that 
L'k{(jjr)  stochastically  dominates  if  J  <  u.  Lk( u')  thus  stochastically  dominates  Lk(uj)  if 


u'  <  u. 

Based  on  the  structure  of  the  myopic  policy,  uk  =  p[\k~1+1^  when  Lfc_i  is  odd.  When  Lk- 1 
is  even,  to  find  a  lower  bound  on  uk,  we  need  to  find  the  smallest  odd  j  such  that  the  last  visit 
to  the  channel  chosen  in  the  k-th  TP  is  j  slots  ago.  From  the  structure  of  the  myopic  policy,  the 
smallest  feasible  odd  j  is  Lk_i  +  2N  —  3,  which  corresponds  to  the  scenario  where  all  N  channels 
are  visited  in  turn  from  the  (k  —  N  +  l)-th  TP  to  the  k- th  TP  with  Lk_N+ 1  =  Lk_N+2  =  •  •  ■  = 
2  =  2.  We  thus  have  uk  >  p[kk~1+2N~3\  We  then  construct  a  hypothetical  system  given  by 
the  first-order  Markov  chain  {L'k}^L1  with  the  following  transition  probabilities. 

if  i  is  odd,  j  =  1 
if  i  is  odd,  j  >  2 
if  i  is  even,  j  =  1 
if  i  is  even,  j  >  2 

The  stationary  distribution  of  this  hypothetical  system  leads  to  the  lower  bound  on  U  given 
in  (12).  ■ 

For  multi-channel  sensing  (M  >  1),  it  is  difficult  to  construct  first-order  Markov  process  to 
stochastically  dominate  or  be  dominated  by  {Lk}™=1.  Instead,  we  establish  a  uniform  statistical 
bound  on  the  distributions  of  all  TPs  based  on  the  structure  of  the  myopic  policy.  The  bounds 
on  the  throughput  when  applied  M  =  1  are  thus  looser  than  those  under  single-channel  sensing 
scenarios  as  given  in  Theorem  2. 

Theorem  3:  Recall  the  definition  of  g ou(-)  given  in  (10).  Under  assumption  A2,  we  have  the 
following  lower  and  upper  bounds  on  throughput  U  when  M  >  1. 


'  ^(l)(l-e)  +  l-& 

(1  -  ri+1(l)(l  -  e))(l  - Por(l  -  e)y-2T(pu)(l  -  e), 
Urnip+aiv-smn  -  e)  +  1  - 

P01  V  V  '  P01  ’ 

.  (1  -  ri+2«-3(i)(i  -  £))(1  -  p„,(i  -  £)))-2r(p11)(i  -  £), 
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.  Case  1:  pn  >  poi 

mc3(  1  -  f)  <jj  <  Mujq(1  -  e) 

1  -  (Pu  -  c3)(  1  -  e)  -  “  1  -  (pn  -  u0)(  1  -  e)  ‘ 

where  c3  =  -  (tu0  -  -  Poi)L^J- 

•  Case  2:  pn  <  p01 


(26) 


Mg  °  ^1(^1, 2/i,  a1?  bi)  <  U  <  Mg  ov2(—,l  -  zul  -  yu  (27) 

where 

ui(-)  =  1  -  (a;0  -  (a;0 -pii)(pn  -  j90i)2L^j_2)(1  -  e), 

u2(-)  =  1  -  (pn(pn  -Poi)  +Poi)(l  -  e), 

Poi 

—  - 

Pn(Pn  -  Pol)  +  Poi’ 

Vi  =  1  -  (Pn(Pn  -Poi)  +Poi)(l'"  e), 

=  (l-e)poi- 


Note  that  a3  and  can  be  arbitrary  since  they  are  arguments  of  the  constant  functions  V\  and 


v2- 

Proof: 

•  Case  1:  pn  >  poi 

Consider  first  the  upper  bound.  Similarly  to  single-channel  sensing,  the  belief  value  uk  of  the 
chosen  channel  in  the  first  slot  of  the  k— th  TP  is  upper  bounded  by  uQ.  Lk(uj0)  thus  stochastically 
dominates  Lk( uk),  and  the  expectation  of  the  former  leads  to  the  upper  bound  on  U  given  in  (26). 

We  now  consider  the  lower  bound.  Recall  that  uk  =  r',/;i:+1)(f  ),  where  Jk  denotes  the 

number  of  consecutive  slots  in  which  the  chosen  channel  has  been  unobserved  since  the  last  visit, 
and  x  denotes  the  belief  value  of  the  chosen  channel  at  the  last  time  the  user  left  it.  Based  on  the 


structure  of  the  myopic  policy,  the  channel  has  the  last  priority  when  the  user  leaves  it.  It  will 
take  at  least  [NfjM  \  slots  before  the  user  returns  to  the  same  channel,  i.e.,  Jk  >  |_fyj  —  1.  Based 
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is  stochastically  dominated  by  Lk(uk),  and  the  expectation  of  the  former  leads  to  the  lower 
bound  on  U  given  in  (26). 

•  Case  2:  pu  <  p01 

Consider  first  the  upper  bound.  Let  tok  denote  the  belief  value  of  the  chosen  channel  in  the  first 
slot  of  the  k — th  TP.  Based  on  the  structure  of  the  myopic  policy,  we  have  cuk  =  TJfc+1(l),  where 
Jk  denotes  the  number  of  consecutive  slots  in  which  the  chosen  channel  has  been  unobserved 
since  the  last  visit.  From  Fig.  3  (b),  we  have  uk  =  TJfc+1(l)  <  r2(l).  Combined  with  the 
hypothetical  system  given  in  (24),  L'fc(r2(l))  is  stochastically  dominated  by  Lk(u>k),  and  the 
expectation  of  the  former  leads  to  the  upper  bound  on  U  given  in  (27). 

We  now  consider  the  lower  bound.  Recall  that  c ok  =  rJfc+1(l).  If  If  Jk  is  odd,  then  rJfc+1(l)  > 
F2LmJ_1(1)  since  2|_|yJ  —  1  is  an  odd  number  (see  Fig.  3  (b)).  If  Jk  is  even,  i.e.,  the  user  has 
stayed  even  slots  before  it  returns  this  channel,  then  Jk  is  at  least  2[Nffs  \ .  we  have  u  = 
rJfc+1(l)  >  r2^1^).  Combined  with  the  hypothetical  system  given  in  (25),  Z4(r2LwJ_1(l)) 
stochastically  dominates  Lk{ uk),  and  the  expectation  of  the  former  leads  to  the  lower  bound  on 
U  given  in  (27). 

■ 

Corollary  2:  For  pu  >  p01,  the  lower  bound  on  throughput  U  increasingly  converges  to  the 

constant  upper  bound  at  geometrical  rate  (pn  —  p0i)^  as  N  increases;  for  pn  <  pm,  the  lower 

2 

bound  on  U  increasingly  converges  to  a  constant  at  geometrical  rate  (p0i  —pn)M- 

Proof:  From  the  closed-form  expressions  of  the  lower  bounds  on  U  given  in  Theorem  2 
and  Theorem  3,  it  is  easy  to  see  that  the  lower  bound  is  monotonically  increasing  with  N. 
Let  x  =  \pn  —  poi|-  For  pu  >  Poi,  after  some  simplifications,  the  lower  bound  has  the  form 
a  +  b/(x\-M J  +  c),  where  a,b,c  (c  f  0)  are  constants.  The  upper  bound  is  a  +  b/c.  We  have 

L— J 

\a+b/(x  m  +c)-a-b/c\  0(b/C2)  as  N  — >  cx).  Thus  the  lower  bound  converges  to  the  upper  bound 

X  M 

with  geometric  rate  xm. 

For  pn  <  pod  the  lower  bound  has  the  form  d  +  e/(x2^M^  1  +  /),  where  d,  e,  f  (/  f  0)  are 
constants.  It  converges  to  d  +  e/ f  as  N  — >  oo.  We  have  ld+e/(^  0{e/{xf2)) 

X.TT 

2 

as  N  — >  oo.  Thus  the  lower  bound  converges  with  geometric  rate  xm.  ■ 

The  convergence  of  the  lower  bound  to  the  upper  bound  when  pn  >  pm  can  be  explained  as 
follows.  The  upper  bound  given  in  Theorem  3  corresponds  to  the  case  where  the  belief  in  the 
first  slot  of  a  TP  is  equal  to  the  stationary  distribution  c oQ.  If  a  user  can  always  switch  to  channels 
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with  probability  uQ  being  in  good  state  when  channel  switches  are  needed,  the  throughput  will 
achieve  the  upper  bound  given  in  (26).  Specifically,  we  have  the  following  theorem,  which  gives 
the  closed-form  performance  under  the  myopic  policy  over  a  finite  horizon. 

Theorem  4:  For  pu  >  Poi  and  N  >  MT ,  under  assumption  A2,  the  expected  total  reward 
over  T  slots  when  the  initial  belief  starts  from  the  stationary  distribution  is  given  below. 

(T-  1) 


VLr(n(l))  =Muc(\-e)( 


+  c4); 


where 


C4  =  1  — 


1  -  (Pu  -  ^o)(l  -  e) 

(K  -  Pll)(l  ~  e))2(l  -  ((pu  -  w„)(l  -  e))T_1) 


(28) 


(1  -  (pn  -w0)(l  -e))2 

Proof:  From  the  structure  of  the  myopic  policy,  if  the  user  observes  state  1  from  a  channel, 
it  will  stay  on  that  channel.  Otherwise,  it  will  switch  to  a  new  channel  (with  belief  u0).  Clearly, 
V  does  not  depend  on  N  since  at  most  MT  channels  need  to  be  considered  during  T  slots. 

In  the  first  slot,  the  user  randomly  chooses  M  channels  and  gets  Mu>0(  1  —  e)  units  of  reward. 
Then  the  user  will  either  stay  or  switch  on  a  channel.  This  process  is  a  Markov  chain  with  states 
“stay”  and  “switch”  as  shown  below. 

1  -Pn(l  -  e) 


Pn(l 


1  -  u0(l  -  e) 


Fig.  4.  The  Markov  chain  with  states  “stay”  and  “switch”. 


If  the  user  observes  1  on  a  channel  after  the  fist  slot,  it  will  stay  and  get  pn(l  —  e)  units 
of  reward  on  this  channel.  Otherwise  it  will  switch  to  a  new  channel  and  get  oi0(l  —  e)  units 
of  reward.  So  V  is  determined  by  the  distribution  of  the  states  of  the  above  two-state  Markov 
chain. 
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V{ua,T)  =  M( E^l11[a;0(l  -  e)  1  -  cu0{  1  -  e)] 
=  M(S^ Jo;0(l-e)  l-w0(l-e)]{ 


Pn(l-e)  l-Pn(l-e) 

M— 1 

Pu(l  -  e) 

ta0(l  -  e)  l-ca0(l-e) 

<W>(  1  -  e) 

+  <W>(1  —  e)) 


1  -  (Pu  -  w0)(l  -  e) 


1  -  (Pu  -  w0)(l  -  e) 


((pn  ~Uq){1  -e))M-1  |  l-pn(l-e)  Pn(l  —  e)  —  1 

-w0(l  -  e)  w0(l  -  e) 


w0(l-e)  l-pu(l-e) 
w0(l-e)  l-pu(l-e) 

Pn(l  -  e) 


} 


>(!  ~  e) 


+  w0(l  —  e)) 


wG(l  —  e)(T  —  1)  Wo(l^e)3(Wo_m)2(i_((pil_Wo)(i_e))T-i)  „ 

-  - 7Z - — T7i - TT - n - 7Z - — m - 7T\9 - h  W°U  “  eB 


'1  -  (pi!  -  Wo)(l  -  e) 


(1  -  (pu  -  w0)(l  -  e))2 


(29) 


From  (29),  we  immediately  see  that  the  throughput  U  is  given  as  follows. 


U  =  lim 

T— >oo 


Mu0{l  -  e) 


(30) 


T  1  -  (Pn  -wc)(l  -  e)’ 

which  agrees  with  the  upper  bound  given  in  Theorem  3. 

The  monotonicity  of  the  difference  between  the  upper  and  lower  bounds  with  respect  to  N 
illustrates  that  the  performance  of  the  multi-channel  opportunistic  system  improves  with  the 
number  N  of  channels,  as  suggested  by  intuition.  For  pu  >  p01,  the  upper  bound  gives  the 
limiting  performance  of  the  opportunistic  system  when  N  — >  oo.  By  Corollary  2,  the  throughput 
of  a  multi-channel  opportunistic  system  with  single-channel  sensing  quickly  saturates  as  the 
number  of  channels  increases;  it  is  thus  crucial  to  enhance  radio  sensing  capability  in  order  to 
fully  exploit  the  communication  opportunities  offered  by  a  large  number  of  channels. 


IV.  Approximation  Factor  of  the  Myopic  Policy 

Although  the  optimality  of  the  myopic  policy  is  proved  for  N  =  2  and  conjectured  for  general 
scenarios  based  on  numerical  results,  establishing  the  optimality  or  simple  sufficient  conditions 
for  optimality  appears  to  be  challenging.  Under  the  discounted  reward  criterion,  we  have  shown 
that  so  long  as  the  discount  factor  is  less  than  1/(M  +  1),  the  myopic  policy  is  optimal  for 
all  N.  In  this  section,  we  take  a  further  step  toward  the  optimality  of  the  myopic  policy.  By 
considering  a  genie  aided  system,  we  establish  a  bound  on  the  performance  loss  of  the  myopic 
policy  and  its  approximation  factor  regarding  to  the  optimal  policy. 
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A.  A  Genie-aided  System 

In  the  Genie-aided  system,  we  assume  the  user  can  sense,  access,  and  obtain  observations 
(ACK/NAK)  from  all  N  channels  .  However,  the  user  can  only  get  reward  from  M  channels 
determined  at  the  beginning  of  each  slot.  Clearly,  the  myopic  policy  (/.<?.,  choose  M  channels 
with  largest  probabilities  of  being  in  state  1  to  accrue  rewards)  is  optimal  since  current  choice 
will  not  affect  the  belief  transitions.  Similar  to  Corollary  1,  the  reward  process  of  the  genie-aided 
system  is  ergodic  under  assumption  A2.  Furthermore,  we  obtain  an  upper  bound  on  the  optimal 
performance  of  the  genie-aided  system. 

Theorem  5:  Define  x=- .  Under  assumption  A2,  the  maximum  steady-state  through¬ 
put  U  in  the  genie-aided  system  is  upper  bounded  as  given  below. 

.  Case  1:  pu  >  Poi 


where 


dk 


U  <  (Mp n  -  Ef=0  | 

(M  -  k)(pu  ~  x)(u0(  1 


N 

k 


4)(1  -  e), 


e))fc(l-02o(l- 


e)) 


N-k 


•  Case  2:  pu  <  Poi 


(31) 


where 


U  <  (. Mx  -  E"0 


N 

k 


GcX1  -  e)> 


ek  =  (M  -  k)(x  ~pu)(u0(l  -  e))N  k(l  -  cv0(l  -  e))k . 


(32) 


Proof: 

•  Case  1:  pu  >  pm 

Based  on  the  ergodicity  of  the  reward  process  in  the  genie-aided  system,  the  initial  belief  vector 
does  not  affect  the  optimal  performance.  Without  loss  of  generality,  assume  the  state  of  each 
channel  starts  from  the  stationary  distribution  u0.  As  a  consequence,  the  number  k  of  channels 
observed  as  1  falls  into  the  binomial  distribution  B(k,N,u>0(  1  —  e))  in  every  slot.  Since  the 
channels  observed  as  1  will  have  the  largest  belief  value  pu  and  other  channels’  belief  values 
will  be  upper  bounded  by  ^  )  in  the  next  slot,  the  expected  reward  obtained  under  the 
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myopic  policy  will  be  upper  bound  by  the  right-hand  side  of  (31).  We  thus  proved  the  upper 
bound  on  U. 

•  Case  2:  pn  <  p0i 

Similarly,  we  assume  the  state  of  each  channel  starts  from  the  stationary  distribution  uj0  without 
loss  of  generality.  The  number  k  of  channels  observed  as  1  falls  into  the  binomial  distribution 
B(k,N,  u;0(l  —  e))  in  every  slot.  Since  the  channels  observed  as  1  will  have  the  smallest  belief 
value  pu  and  other  channels’  belief  values  will  be  upper  bounded  by  r(epii+*^_  )  in  the  next 
slot,  the  expected  reward  obtained  under  the  myopic  policy  will  be  upper  bound  by  right-hand 
side  of  (32).  We  thus  proved  the  upper  bound  on  U.  ■ 

B.  Approximation  Factor 

Clearly,  the  optimal  performance  of  the  genie-aided  system  is  an  upper  bound  on  the  maximum 
throughput  in  the  original  multi-channel  opportunistic  access  system.  In  other  words,  U  provides 
a  performance  benchmark  of  all  sensing  policies,  including  the  myopic  policy.  To  better  bound 
the  performance  of  the  myopic  policy,  we  present  another  lower  bound  on  the  throughput  U 
under  the  myopic  policy. 

Theorem  6:  Let  U  be  the  throughput  under  random  sensing  policy  that  chooses  M  out  of  N 
channels  with  uniform  probability  ( i.e choose  any  set  of  M  channels  with  probability  1/^  ^  'j ), 
and  U*  the  maximum  throughput  under  the  optimal  policy.  We  have 

Mu0(l  -e)  =  U  <  U  <  U*  <  U  <  Nu0(l  -  e).  (33) 

Proof:  Since  channels  are  stochastically  identical,  the  random  sensing  policy  is  equivalent 
to  the  static  policy  that  chooses  a  constant  set  of  M  channels  in  each  slot.  Clearly,  the  long-run 
throughput  of  the  static  policy  on  a  chosen  channel  is  given  by  the  stationary  distribution  uj0 
multiplied  by  the  probability  (1  —  e)  of  no  false  alarm. 

To  prove  U  <  U,  we  note  that  the  expected  immediate  reward  under  the  random  sensing 
policy  in  each  slot  is  given  by  the  expected  sum  of  M  randomly  chosen  belief  values  under 
any  given  policy  (including  the  myopic  policy).  Since  the  expected  immediate  reward  under  the 
myopic  policy  in  each  slot  is  given  by  the  expected  sum  of  the  first  M  largest  belief  values.  The 
throughput  under  the  myopic  policy  is  thus  lower  bounded  by  that  under  the  random  sensing 
policy. 
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The  proof  for  U  <  U*  <  U  is  trivial.  To  prove  U  <  Nuj0{1  —  e),  we  note  that  Nu0(l  —  e)  is 
the  throughput  under  the  policy  that  senses  and  accrues  rewards  from  all  of  the  N  channels.  ■ 
Combining  the  maximum  of  the  lower  bounds  on  U  given  in  Theorem  2,  Theorem  3  and 
Theorem  6  and  the  minimum  of  the  upper  bounds  on  U  given  in  Theorem  5  and  Theorem  6, 
we  obtain  a  uniform  bound  on  the  performance  loss  under  the  myopic  policy.  We  further  obtain 
the  approximation  factor  of  the  myopic  policy  as  given  below. 

Corollary  3:  Let  (rj  G  [0, 1])  be  the  approximation  factor  of  the  myopic  policy.  Under 

assumption  A2,  we  have 


M 
N  ’ 


n  >  < 


max{ 


1  M\ 
2’  N  J  ’ 


1, 


if  p ii  >  poi 

if  Pn  <  Pm 
if  pn=  poi 


(34) 


Proof:  From  Theorem  6,  we  directly  see  that  rj  >  .  Consider  pu  <  p0i-  Based  on 

Theorem  5,  we  have  U  <  MT(ep^+f_p^ )  (see  the  proof  of  Theorem  5).  We  thus  have 


U  ^  Muj0  ^  1  ^1 

r,~u~*-u-  MT(=gr=)  -  1  +  PD1-P.1  -  2- 

For  the  trivial  case  pn  =  p01,  we  note  that  the  lower  bound  on  U  given  in  Theorem  3  agrees 
with  the  upper  bound  on  U  given  in  Theorem  5.  ■ 


V.  Numerical  Examples 

In  this  section,  we  demonstrate  the  tightness  of  the  bounds  on  U  given  in  Sec.  III-C2  and 
Sec.  IV-A.  In  particular,  we  are  interested  in  the  lower  and  upper  bounds  on  the  performance 
of  the  myopic  policy  given  in  Theorem  2  and  Theorem  3,  and  the  upper  bound  on  the  optimal 
performance  in  the  genie-aided  system  given  in  Theorem  5.  We  also  generate  the  performance 
of  the  myopic  policy  and  the  optimal  performance  in  the  genie-aided  system  by  Monte  Carlo 
simulations.  Fig.  5  illustrates  the  bounds  on  the  performance  of  the  myopic  policy  under  single¬ 
channel  sensing.  Fig.  6  illustrates  the  bounds  on  the  performance  of  the  myopic  policy  under 
multi-channel  sensing  (M  =  2).  We  observe  that  the  lower  bound  on  the  performance  of  the 
myopic  policy  quickly  converges  to  the  upper  bound  as  — >  oo  when  channels  are  positively 
correlated.  We  also  observe  from  Fig.  5-  6  that  the  upper  bound  on  the  optimal  performance  in 
the  genie-aided  system  is  tight. 
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Fig.  5.  Performance  bounds  of  the  myopic  policy  (M  =  1,  pn  =  0.8,  poi  =  0.2,  t  =  0.0312). 


VI.  Conclusion  and  Future  Work 

In  this  paper,  we  have  analyzed  the  performance  of  the  myopic  sensing  policy  in  multi-channel 
opportunistic  access  under  an  independent  and  stochastically  identical  Gilbert-Elliot  channel 
model  with  noisy  state  observations.  Based  on  the  conjectured  optimality  of  the  myopic  sensing 
policy,  the  obtained  analytical  results  allow  us  to  systematically  examine  the  impact  of  the  number 
of  channels  and  channel  dynamics  (transition  probabilities)  on  the  system  performance.  An 
approximation  factor  of  the  myopic  policy  has  been  established.  Future  work  includes  proving  the 
optimality  conjecture  of  the  myopic  policy,  and  generalization  to  independent  and  stochastically 
non-identical  channel  model  by  investigating  Whittle’s  index  policy. 

References 

[1]  E.N.  Gilbert,  “Capacity  of  burst-noise  channels,”  Bell  Syst.  Tech.  /.,  vol.  39,  pp.  1253-1265,  Sept.  1960.  WA),  pp.  331-335, 
June  1995. 

[2]  Q.  Zhao  and  B.  Sadler,  “A  Survey  of  Dynamic  Spectrum  Access,”  IEEE  Signal  Processing  magazine,  vol.  24,  pp.  79-89, 
May  2007. 

[3]  Q.  Zhao,  L.  Tong,  A.  Swami,  and  Y.  Chen,  “Decentralized  Cognitive  MAC  for  Opportunistic  Spectrum  Access  in  Ad  Hoc 


TECHNICAL  REPORT  TR-09-01,  UC  DAVIS,  MARCH,  2009. 


27 


Fig.  6.  Performance  bounds  of  the  myopic  policy  (M  =  2,  pn  =  0.8,  poi  =  0.2,  t  =  0.0312). 


Networks:  A  POMDP  Framework,”  in  IEEE  Journal  on  Selected  Areas  in  Communications,  vol.  25,  no.  3,  pp.  589-600, 
April,  2007. 

[4]  C.  H.  Papadimitriou  and  J.  N.  Tsitsiklis,  “The  Complexity  of  Optimal  Queueing  Network  Control,”  in  Mathematics  of 
Operations  Research,  Vol.  24,  No.  2,  May  1999,  pp.  293-305. 

[5]  Q.  Zhao,  B.  Krishnamachari,  and  K.  Liu,  “On  Myopic  Sensing  for  Multi-Channel  Opportunistic  Access:  Structure, 
Optimality,  and  Performance,”  in  IEEE  Transactions  on  Wireless  Communications,  vol.  7,  no.  12,  pp.  5431-5440,  December, 
2008. 

[6]  S.  H.  Ahmad,  M.  Liu,  T.  Javadi,  Q.  Zhao  and  B.  Krishnamachari,  “Optimality  of  Myopic  Sensing  in  Multi-Channel 
Opportunistic  Access,”  submitted  to  IEEE  Transactions  on  Information  Theory,  May,  2008. 

[7]  K.  Liu  and  Q.  Zhao,  “Indexability  of  Restless  Bandit  Problems  and  Optimality  of  Whittle’s  Index  for  Dy¬ 
namic  Multichannel  Access,”  submitted  to  IEEE  Transactions  on  Information  Theory,  November,  2008.  Available  at 
http://arxiv.org/abs/0810.4658  (conference  versions  appeared  in  Proc.  of  the  5th  IEEE  Conference  on  Sensor,  Mesh  and  Ad 
Hoc  Communications  and  Networks  (SECON)  Workshops,  June,  2008  and  Proc.  of  IEEE  Asilomar  Conference  on  Signals, 
Systems,  and  Computers,  October,  2008). 

[8]  Y.  Chen,  Q.  Zhao,  and  A.  Swami,  “Joint  Design  and  Separation  Principle  for  Opportunistic  Spectrum  Access  in  the  Presence 
of  Sensing  Errors,”  in  IEEE  Transactions  on  Information  Theory,  vol.  54,  no.  5,  pp.  2053-2071,  May,  2008. 

[9]  Q.  Zhao,  B.  Krishnamachari,  and  K.  Liu,  “Low-Complexity  Approaches  to  Spectrum  Opportunity  Tracking,”  in  Proc.  of 
the  2nd  International  Conference  on  Cognitive  Radio  Oriented  Wireless  Networks  and  Communications  ,  August  2007. 

[10]  R.  Smallwood  and  E.  Sondik,  “The  optimal  control  of  partially  ovservable  Markov  processes  over  a  finite  horizon,” 
Operations  Research,  pp.  1071-1088,  1971. 


